Comparing distributions may be done with multiple quantile/ECDF plots overlaid.

Read the singers data from Cleveland.

suppressMessages(library(tidyverse))
singer <- readRDS("Cleveland_singer.rds")
  1. Make a quantile plot (compute those manually according to the instructions in p. 21) for the heights of the first bass singers. Overlay a boxplot for the heights of the first bass singers on top of the quantile plot, and add segments (whose heights are based on boxplot.stats) and full points in green/blue/red (whose heights are based on linear interpolation of the f-values–heights. For linear interpolation, you can use the approx function) as is demonstrated in the following plot (the plot was generated with base-R graphics, you may use ggplot or plotly according to your preferences).

Explain in words what we see here - what is the connection between the boxplot and the quantile plot?

[5 pt] Extra credit - if you try to generate the above plot for the first soprano, first alto or first tenor, something will be off. Explain exactly what goes wrong

  1. No matter how narrow the bin width you set, you will not get a histogram where all bars are of height 1 (as we’ve seen happen in the lecture). Why is this happening? For a very narrow bin width, what is the connection between the histogram and the quantile plot? Demonstrate your argument with a plot that overlays the two.

  2. Building on the latter, exemplify in words and plot the connection between the histogram and quantile plot of the first bass heights.

  3. Cleveland Ch. 2 recommends comparing pairs of distributions with QQ plots (p. 21-22) and plotting pairwise QQ plots (p. 24)

    Implement manually the QQ plot for the comparison of two distributions from p. 21 of Cleveland. In addition, when the number of observations is large, Cleveland suggests in p. 21 to create the plot with fewer quantiles than the number of observations. In your implementation, in addition to full data from two distributions, accept a vector of qunatiles to perform the comparison on (say \(\{0.01, 0.03, ..., 0.97, 0.99\}\)).

    Both the full and manually selected quantiles implementations require linear interpolation. You may utilize the approx() function for this end.

  4. In class we saw how to utilize segments and symbols in order to generate a dumbbell plot. Recall:

library(plotly)
library(tidyverse)

mpg %>%
  group_by(model) %>%
  summarise(c = mean(cty), h = mean(hwy)) %>%
  mutate(model = forcats::fct_reorder(model, c)) %>%
  plot_ly() %>%
  add_segments(
    x = ~c, y = ~model,
    xend = ~h, yend = ~model, 
    color = I("gray"), showlegend = FALSE
  ) %>%
  add_markers(
    x = ~c, y = ~model,
    color = I("blue"),
    name = "mpg city"
  ) %>%
  add_markers(
    x = ~h, y = ~model, 
    color = I("red"),
    name  = "mpg highway"
  ) %>%
  layout(xaxis = list(title = "Miles per gallon"))

It may be interesting to add another dimension to the plot by mapping another variable to the segments colors.

Modify the above code so that the variable class is mapped to the segments’ color, as is demonstrated in the following:

Note that in the above plot, clicking on a legend entry hides/shows not only a segments, but also the MPG symbols that are associated with it. Hint: look for “Grouped Legend” for the concept and “Subplot Grouped Legend” for another useful piece of information at https://plotly.com/r/legend/. Also, please pay attention to generating the tooltip correctly. Be mindful of the location of the legend.

What are we learning from the additional dimension we added to the plot?

  1. The above plot exposes a problem with our original visualization. What is the problem? Fix it.

Extra credit

[20 pt]

We saw in class that we can add paths according to the ordering of a variable in order to add it to a scatter plot that shows two other variables.

So, starting from the following code:

economics %>%
  arrange(date) %>%
  plot_ly(x = ~unemploy, y = ~psavert, text = ~date)
## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plotly.com/r/reference/#scatter
## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode

Find out how to get to the plot below. Some hints:

colorRamp(c("purple", "red", "yellow"))
lag()
colorbar()
add_segments()
ax = c(2500, 7300, 9500, 5500, 8500, 11000)
ay = c(14, 15, 11, 9, 3.5, 8)
format(date, "%b %Y")